Robust Processing of Real-World Natural-Language Texts

نویسندگان

  • Jerry R. Hobbs
  • Douglas E. Appelt
  • John Bear
  • Mabry Tyson
چکیده

I1. is often assumed that when natural language processing meets the real world, the ideal of aiming for complete and correct interpretations has to be abandoned. However, our experience with TACITUS; especially in the MUC-3 evaluation, has shown that. principled techniques fox' syntactic and pragmatic analysis can be bolstered with methods for achieving robustness. We describe three techniques for making syntactic analysis more robust -an agendabased scheduling parser, a recovery technique for failed parses, and a new technique called terminal substring parsing. For pragmatics processing, we describe how the method of abductive inference is inherently robust, in that an interpretation is always possible, so that in the absence of the required world knowledge, performance degrades gracefully. Each of these techlfiques have been evaluated and the results of the evaluations are presented. 1 I n t r o d u c t i o n If automat ic text processing is to be a useful enterprise, it. must be demonstrated that the completeness and accuracy of the information extracted is adequate for the application one has in nfind. While it is clear that certain applications require only a minimal level of competence from a system, it is also true that many applica t ionsrequire a very high degree of completeness and a.ccuracy in text processing, and an increase in capability in either area is a clear advantage. Therefore we adopt an extremely lfigh standard against which the performance of a text processing system should be measured: it. should recover all information that is implicitly or explicitly present in the text, and it should do so without making mistakes. Tiffs standard is far beyond the state of the art. It is an impossibly high standard for human beings, let alone machines. However, progress toward adequate text processing is best. served by setting ambitious goals. For this reason we believe that, while it may be necessary in the intermediate term to settle for results that are far short of this ult imate goal, any linguistic theory or system architecture that is adopted should not be demonstrably inconsistent with attaining this objective. However, if one is interested, as we are, in the potentially successful application of these intermediateterm systems to real problems, it is impossible to ignore the question of whether they can be made efficient enough and robust enough for application. 1.1 T h e T A C I T U S S y s t e m The TACITUS text processing system has been under development at SRI International for the last six years. This system has been designed as a first step toward the realization of a system with very high completeness and accuracy in its ability to extract information from text. The general philosophy underlying the design oI this system is that the system, to the max imu m extent possible, should not discard any information that might be semantically or pragmatically relevant to a full, correct interpretation. The effect of this design philosophy on the system architecture is manifested in the following characteristics: * TACITUS relies on a large, comprehensive lexicon containing detailed syntactic subcategorization information for each lexieal item. . TACITUS produces a parse and semantic interpretation of each sentence using a comprehensive grammar of English in which different possible predicateargument relations are associated with different syntactic structures. • TACITUS relies on a general abductive reasoning meclmnism to uncover the implicit assumptions necessary to explain the coherence of the explicit text. These basic design decisions do not by themselves distinguish TACITUS from a number of other naturallanguage processing systems. However, they are somewhat controversial given the intermediate goal of producing systems that are useful for existing appl icat ions Criticism of the overall design with respect to this goal centers on the following observations: • The syntactic structure of English is very complex and no g rammar of English has been constructed that has complete coverage of the syntax one encounters in real-world texts. Much of the text thai needs to be processed will lie outside the scope ol the best g rammars available, and therefore canno! be understood by a. system that relies on a complet(

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LearningPinocchio: adaptive information extraction for real world applications

The new frontier of research on Information Extraction from texts is portability without any knowledge of Natural Language Processing. The market potential is very large in principle, provided that a suitable easy-to-use and effective methodology is provided. In this paper we describe LearningPinocchio, a system for adaptive Information Extraction from texts that is having good commercial and s...

متن کامل

Codalab username: wenqiwooo Simple Dynamic Coattention Networks

Reading comprehension (RC), or the capability to process document texts and answer questions about them is a difficult task for machines, as human language understanding and real-world knowledge are needed [4]. This can serve a wide range of applications, from simplifying information retrieval processes to building more robust artificial intelligence. Previously, most natural language processin...

متن کامل

What to do about non-standard (or non-canonical) language in NLP

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technologies to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions,...

متن کامل

Integrated Processing Produces Robust Understanding Mallory Seifridge

Natural language interfaces to computers must deal with wide variation in real-world input. This paper proposes that, in order to handle real-world input robustly, a natural language interface should be constructed in accord with principles of integrated processing: processing syntax and semantics at the same time, processing syntax and semantics using the same mechanisms, and processing langua...

متن کامل

The GE NLToolset: A Software Foundation for Intelligent Text Processing

Many obstacles stand in the way of computer programs that could read and digest volumes of natural language text. The foremost of these difficulties is the quantity and variety of knowledge about language and about the world that seems to be a prerequisite for any substantial language understanding. In its most general form, the robust text processing problem remains insurmountable; yet practic...

متن کامل

Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm

Sentence-level aligned parallel texts are important resources for a number of natural language processing (NLP) tasks and applications such as statistical machine translation and cross-language information retrieval. With the rapid growth of online parallel texts, efficient and robust sentence alignment algorithms become increasingly important. In this paper, we propose a fast and robust senten...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992